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Abstract: We present a Bayesian model for pairwise nonlinear registration 
of functional data. We use the Riemannian geometry of the space of warping 
functions to define appropriate prior distributions and sample from the 
posterior using importance sampling. A simple square-root transformation 
is used to simplify the geometry of the space of warping functions, which 
allows for computation of sample statistics, such as the mean and median, 
and a fast implementation of a /c-means clustering algorithm. These tools 
allow for efficient posterior inference, where multiple modes of the posterior 
distribution corresponding to multiple plausible alignments of the given 
functions are found. We also show pointwise 95% credible intervals to assess 
the uncertainty of the alignment in different clusters. We validate this model 
using simulations and present multiple examples on real data from different 
application domains including biometrics and medicine. 
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1. Introduction 

The problem of registration of functional data is important in many branches of 
science. In simple terms, it deals with deciding how points on one function match 
in some optimal way with points on another function. In contrast to landmark- 
based matching, such an approach matches the entire domains of the functions 
in a general registration problem. The study of registration problems is popular 
in image analysis where pixels or voxels across images are matched, and in shape 
analysis of objects where points across shapes are matched. One can broadly 
classify registration problems into two main groups: (1) pairwise registration 
and (2) groupwise registration. In pairwise registration, one solves for an optimal 
matching between two objects, while in groupwise registration multiple (> 2) 
objects are simultaneously registered. In this paper, we focus on the problem 
of pairwise registration. This problem has been referred to in many different 
ways, some of which are alignment, warping, deformation matching, amplitude- 
phase separation, and so on. While registration can be studied for many types 
of objects, from simple functions to complex high-dimensional structures, the 
fundamental issues in registration are often similar. We will focus on perhaps the 
simplest objects for studying registration problems, K-valued functions on [0,1]. 
More specifically, we will take a Bayesian approach to this problem, motivated 
by geometrical considerations; the method will be characterized by the definition 
of a geometric prior on a suitable function space, representing the parameter 
space of interest. We also compare the proposed method to past ideas that often 
take an optimization-based approach. 

To motivate the function alignment problem, consider the example shown 
in Figure 1. In panel (a), we display an example of a PQRST complex with la¬ 
beled structures (P wave, QRS complex, T wave). This function represents a full 
heartbeat cycle and can be extracted from long electrocardiogram (ECG) sig¬ 
nals for the purposes of diagnosing heart diseases such as myocardial infarction. 
The difficulty with using such objects for diagnosis is highlighted in panel (b). 
As given, the P wave and QRS complex on the red function occur earlier than 
on the blue one. This is usually due to natural variability in nonlinear heart¬ 
beat dynamics. In general, given two PQRST complexes, their important salient 
features are often not in correspondence. This presents a major challenge when 
modeling these functions. Even simple statistics such as the cross-sectional mean 
can be meaningless (see Figure 8 and row 3 in Figure 13). Aligning the functions 
prior to subsequent statistical analyses is thus required. The purpose of pairwise 
alignment is to estimate a warping function, and additionally the uncertainty in 
this estimate, that aligns the prominent features across two functions. In panel 
(c), we display the estimated warping function in red, and in panel (d) we show 
the resulting alignment of the two PQRST complexes. Now, the P wave, QRS 
complex and T wave occur at the same time across both functions. 
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Fig 1. (a) A PQRST complex with labeled salient features: the P wave, QRS complex and T 
wave, (b) Two unaligned PQRST complexes, (c) The estimated warping function in red with 
identity warping in blue as a reference, (d) The same PQRST complexes as in panel (b) but 
aligned using the estimated warping. 


There exists a large literature on statistical analysis of functions, in part due 
to the pioneering efforts of Ramsay and Kneip [25, 9], and several others [17, 34]. 
When restricting to the analysis of elastic functions (functions that are tempo¬ 
rally aligned) the literature is relatively recent [24, 7, 17, 8, 34, 10, 21]. The gen¬ 
eral approach in most of these methods is to use an energy function to compute 
optimal registrations and perform subsequent analysis on the aligned functions 
using standard tools from functional data analysis such as the cross-sectional 
mean, covariance and functional Principal Component Analysis (fPCA). The 
importance of registration in functional data is undeniable as evidenced in a 
recent Special Section of the Electronic Journal of Statistics titled Statistics of 
Time Warpings and Phase Variations [19]; this section contained a set of ap¬ 
plied papers that analyzed four different datasets, including mass spectrometry 
functions [11], neural spike trains [42], juggling trajectories [23] and internal 
carotid arteries [27]. 

Recently, it has been argued that a Bayesian approach rather than pure 
optimization is a better option for many situations. The advantages of a model- 
based Bayesian approach include: 

1. A comprehensive exploration of the warping variable space resulting in 
potential multimodal solutions to the registration problem; 

2. Assessment of uncertainty, via credible intervals, associated with the com¬ 
puted estimates. 

The literature on registration methods that are based on Bayesian principles 
is fairly limited. Telesca and Inoue [35] proposed a semi-parametric model for 
groupwise alignment of functional data. These models were further extended in 
the context of analyzing microarray data in [36]. A nonparameteric approach 
to the groupwise registration problem was also proposed recently in [33] . A dif¬ 
ferent Bayesian model was proposed for registering liquid chromatography-mass 
spectrometry data in [37] . The main difficulty in specifying Bayesian registration 
models lies in defining an appropriate prior on the space of warping functions, or 
some relevant subset, to enable efficient inference. In [30], Srivastava and Jermyn 
defined a Gaussian-type prior distribution on the space of warping functions, 
via the geodesic distance, in the context of detecting shapes in two-dimensional 
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point clouds. The recent model of Cheng et al. [2] used the square-root slope 
function (SRSF) representation of functional data and utilized the fact that 
the derivative of a warping function is a probability density function. In this 
way, they constructed a Dirichlet process to impose a prior model implicitly 
on the space of warping functions, and sampled from the posterior distribution 
using Markov chain Monte Carlo (MCMC) techniques. The SRSF representa¬ 
tion of functional data has many desirable properties related to the registration 
problem, which we emphasize in Section 2. 

In the current paper, we describe a convenient geometric structure, a unit 
sphere, using the square-root density (SRD) representation of warping functions 
and use its geometry to impose the prior. In this setup, we develop a Bayesian 
registration model and utilize importance sampling from the posterior to com¬ 
pute posterior functionals such as the mean, median or maximum a posteriori 
(MAP) estimate. We also provide pointwise standard deviations and credible 
intervals to assess alignment uncertainty. We show that these tools are espe¬ 
cially effective when two or more registrations are plausible. Thus, the main 
contributions of this paper are the following: 

1. We use the spherical geometry of the space of warping functions to define 
a class of truncated wrapped normal prior distributions for the purpose of 
Bayesian alignment of functional data; 

2. We define a sampling importance re-sampling approach to sample from 
the marginal posterior distribution of warping functions; 

3. We use the Riemannian geometry of the space of warping functions to 
define an efficient k -means clustering algorithm, which can be used to 
identify multiple modes in the posterior representing different plausible 
alignments of the observed functions. 

The rest of this paper is organized as follows. In Section 2, we give a detailed 
description of the registration problem and describe tools for statistical analysis 
on the space of warping functions. In Section 3, we introduce our registration 
model and in Section 4 we describe an importance sampling approach for sam¬ 
pling from the posterior distribution of warping functions. Finally, in Sections 5 
and 6 , we present simulation studies and different applications of the proposed 
framework. We emphasize examples where the posterior distribution is multi¬ 
modal. Finally, we close with a brief summary and directions for future work in 
Section 7. 

2. Problem Background 

Before we describe our Bayesian framework, we first setup the registration prob¬ 
lem mathematically. Let T be an appropriate subset (made precise later) of 
real-valued functions on the interval [0,1]. For any two functions /i ,/2 £ F, 
the registration problem is defined as finding the mapping 7 such that point 
t £ [ 0 , 1 ] on the domain of fi is matched to the point 7 (t) £ [ 0 , 1 ] on the 
domain of / 2 - In other words, the functions fi{t) and are optimally 
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matched under the chosen optimality criterion. The main question that arises 
is: What should be the criterion for optimal registration? A natural tendency 
is to choose an L p -norm between /i and / 2 o 7 , but there are some known lim¬ 
itations of that approach. For instance, if we choose the L 2 norm, defined as 

Il/i ~ /s|| = \jJ q \fi(t) — f 2 {t)\ 2 dt (| • | is the standard Euclidean norm), we 
obtain the following optimization problem: 

7 * = arginf H/i -/ 2 o 7 ||. ( 1 ) 

7 

This setup can lead to a degenerate solution, termed the pinching effect demon¬ 
strated in [20]. In this case, one can pinch the entire function / 2 to get arbitrar¬ 
ily close to fi in L 2 norm. To avoid this situation, one often adds a roughness 
penalty on 7 , denoted by 7 ?.( 7 ), leading to the optimization problem given by 
7 * = arginf 7 (||/i — /2 0 7|| 2 + A 7 \l( 7 )). Although this avoids the pinching effect, 
it introduces some other issues. First, the choice of A is not obvious in general 
cases. Second and more important is the fact that this solution is not symmetric. 
That is, the optimal registration of fi to / 2 can be quite different from that of / 2 
to fi. Another related issue is that this criterion is not a proper metric and this 
leads to additional problems in later analysis. Most papers on registration of 
functional data involve this setup and inherit the above-mentioned limitations. 

To avoid these issues, Srivastava et al. [32, 13] proposed an approach that 
has its foundations in differential geometry. First, let the set of all registration 
or warping functions be defined as T = {7 : [0,1] —► [0,1] | 7 ( 0 ) = 0, 7 ( 1 ) = 
1, 0 < 7 < 00}. r forms a Lie group under composition, i.e., for any 71,72 £ F 
their composition 71 o 72 is also in F, and for any 7 £ T there is a unique 
7 -1 £ T. The 7 id {t) = t is the identity element of this group. The next item is 
to represent the given functions by their square-root slope functions (SRSFs): 

q{t) = sign(/(t))y^|/(t)|. Note that the SRSF is the one-dimensional version of 
the square-root velocity function used for shape analysis of higher-dimensional 
curves [31, 12]. 

For registration under this approach, each / £ T is represented by its SRSF 
q. One sets T to be the space of all absolutely continuous functions and the 
resulting space of all SRSFs is L 2 ([0,1],R) henceforth referred to simply as L 2 . 
For every q £ L 2 there exists a function / (unique up to a constant) such that 
the given q is the SRSF of that /. In fact, this function can be obtained precisely 
using f(t) = /(0) + f Q g(s)|g(s)|ds. Note that if a function / is warped by 7 to 
/ 07 , its SRSF changes from q to (q,y) = (go 7 )^ 7 ; this last term involving 
y/j is an important departure from previous solutions. To setup the registration 
problem, we define an equivalence class of an SRSF as [q] = {(<7, 7 )|7 £ L}. 
Finally, the pairwise registration between any two functions fi and / 2 is per¬ 
formed by solving an optimization problem over equivalence classes of their 
SRSF representations: 

7dp = arg inf ||gi-(g 2 , 7 )||. (2) 

7EI 

The solution to this problem is computed using the dynamic programming (DP) 
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algorithm. The resulting distance between the aligned /1 and /2 is given by 

[qi]) = Iki - (q2,iDp)\\- 

As described in [32] , this framework has many advances: it avoids the pinching 
problem, its registration solution is symmetric, it does not require an additional 
regularization term and the choice of A that goes with it, and it is actually a 
proper metric on the quotient space JF/T, which provides important tools for 
ensuing analysis. The most important reason why this setup avoids many prob¬ 
lems of Equation 1 is that ||qi — < 72 1| = ||(< 7 i, 7 ) — (< 72 ,t)II for any 7 £ T. In 
mathematical terms, it means that the action of T on L 2 is by isometries. The 
original method was later extended to apply to statistical analysis of cyclosta¬ 
tionary biosignals [15], and was shown to perform well in different applications 
[38, 39, 43, 19], 

While the framework of Srivastava et al. [32] is precise in mathematically 
defining the function registration problem, it solves for optimal warping func¬ 
tions via energy optimization. In this paper, we argue that a model-based 
Bayesian approach has many additional advantages. Thus, to preserve the nice 
properties, such as the isometric action of T under the L 2 metric, we build our 
Bayesian model using the SRSF representation of functional data. 

2.1. Representation Space of Warping Functions 

The proposed Bayesian model defines prior distributions and importance func¬ 
tions on the space of warping functions T. Thus, we are faced with defining 
statistics and probability distributions on this space. In order to do this we use 
the Fisher-Rao Riemannian metric on T, which is given by (for w±, W 2 £ T 7 (T) 
and 7 £ T) [29, 31, 12]: 



( 3 ) 


where w and 7 represent derivatives. An important property of the Fislier- 
Rao metric is that it is invariant to re-parameterizations of probability density 
functions [41]. While this is not the only metric that achieves this property, 
it is important to note that there is no invariant metric that does not include 
derivatives. It is possible to define statistics and probability models directly 
on T under the Fisher-Rao metric, but this proves to be very complicated due 
to the non-trivial Riemannian geometry of this space. We use the Fisher-Rao 
Riemannian geometry in our Bayesian setup because the desirable properties 
of this metric (i.e., parameterization invariance) will naturally translate to the 
prior distributions on T. 

Inference on T is greatly simplified using a convenient transformation, which 
is similar to the definition of the SRSF for general functions [1] . 

Definition 1. Define the mapping <p : T —> 4/. Then, given an element 7 £ 
T, define a new representation ip : [0,1] — > K>o using the square-root of its 
derivative as <p{ 7) = ip = y/j. 
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This is the same as the SRSF defined earlier for functions and takes this form 
because 7 > 0 V t. For simplicity and to distinguish it from the SRSF represen¬ 
tation of observed functions, we refer to this representation as the square-root 
density (SRD). The identity map 7 idit) = t maps to a constant function with 
value = 1. An important advantage of this transformation is that the 

L 2 norm of a function ip is f. Thus, the set of all such ips, denoted by T, is a 
subset of the unit sphere in L 2 . Furthermore, as shown in [1, 29, 31, 12], the 
Fisher-Rao metric on the space of warping functions simplifies to the L 2 metric 
on T, which in turn greatly simplifies all computation. Given a function ip one 
can easily compute the corresponding warping function via integration using 
7 (t) = f Q ip(s) 2 ds; this provides the inverse mapping </> -1 : T —>• T. Thus, the 
geodesic path between two warping functions, 71 , 72 £ F represented using their 
SRDs ipi, tp 2 £ 'F, is simply the great circle connecting them (a : [0,1] —>• H/), 
a(r) = [sin(fl — 9r)ipi + sin(0r)^ 2 ], where 9 represents the length of this 
path (geodesic distance between warping functions 71 and 72 under the Fisher- 
Rao metric) and is simply the arc-length between ip\ and ip2'. 

dFR{ 71 ) 72 ) = d(ip!,ip 2 ) = 9 = cos - 1 ((^i, 1 P 2 ) )> (4) 

where (•, •) is the standard L 2 inner product. 

Since the differential geometry of the sphere is well known, this transforma¬ 
tion also simplifies the problem of defining probability distributions of warping 
functions. The general approach will be to define wrapped probability distri¬ 
butions, and perform random sampling and probability calculations on tangent 
spaces of T; the tangent space for all ip G ^ is defined as ( 'F ) = {u : [0,1] —► 
K|(u,'i/’) = 0}. In order to achieve this goal, we must first define some standard 
tools from differential geometry for this space: 

1 . Exponential map: For ip £ 4 / and v £ T^,(\ F), the exponential map is 
defined as exp : T^(\ F) —>■ 4/ by exp^,(v) = cos(||+ sl "|^'^ u. 

2. Inverse exponential map: For ipi, b 2 G \F, the inverse exponential map 

is defined as exp -1 : T by exp^ 1 ^) = W ’2 - cos( 0 )yi>i). 

3. Parallel transport: For tp 1: ip 2 £ 4/, the shortest geodesic path a : 
[ 0 , 1 ] —► U/ such that o( 0 ) = ipi and a(l) = tp 2 , and a vector v £ T^ 1 (\ k), 
its parallel transport along a to ip 2 is defined as k : 7)/, 1 ('F) —> T^ 2 (T) by 

k ( v ) =v ~ \\il+£w (^ 1 +v>2)- 

The exponential and inverse exponential maps provide a natural way of mapping 
points from the representation space T to the tangent space (at a particular 
element of 4/) and vice versa. Parallel transport long geodesic paths allows 
translation of tangent vectors from one tangent space to another. An important 
property of parallel transport is that the mapping n is an isometry between 
the two tangent spaces, i.e., for 17 , u 2 £ T^ > 1 (\k), (ui,u 2 ) = (tt(vi ),k(u 2 )). This 
tool from differential geometry is useful in defining probability models on the 
space of warping functions. In particular, we define an orthonormal basis in the 
tangent space at any point on 'F by transporting a standard basis defined on 
the tangent space at the identity element, Ti(’P). 
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Summary statistics on 'f: In addition to defining prior distributions on 
the space of warping functions, we would like to be able to compute summary 
statistics such as the mean or median. These tools are especially useful in infer¬ 
ence based on samples generated from the posterior distribution. Suppose that 
we have a sample of warping functions 71 ,... , 7 p . To begin, we are interested 
in defining a mean and median of these functions. To do this we again exploit 
the geometry of We begin by mapping all of the functions 7 to their corre¬ 
sponding SRD representations resulting in ifi,... , 1 <p p . Once this is done, all of 
our data is on the subset of the unit sphere, where the geodesic distance is used 
to compute their intrinsic mean and median as follows. The sample Karcher 
mean is given by fp = argmku g ^ 2f =1 C KV’, V’i ) 2 while the sample geometric 
median is defined as if = argmin^g,^ d{if,ipi). Gradient-based approaches 
for finding the Karcher mean and geometric median are given in several places 
[16, 5, 6 , 14] and are omitted here for brevity. 

K-means clustering on 47 One of the motivations behind this work is the 
discovery and analysis of multiple modes in the posterior distribution of warping 
functions. For this purpose, we introduce a fc-means clustering approach on 4/. 
In the previous section, we defined a procedure to compute the Karcher mean of 
warping functions and we will use it to specify the fc-means clustering algorithm. 
Let 71 ,.. ., 7 P be a sample from the posterior distribution and ipi,... ,ip p be their 
corresponding SRDs. The fc-means clustering approach computes a partition 
of the sample space such that the within cluster sum of squared distances is 
minimized. This is achieved using the following standard algorithm [18]: 

Algorithm 1. (k-Means): Initialize using k unique functions ififi, • ■ ■ I'tpkfi 
as cluster centers and set j = 0. 

(1) For each i = 1,... ,p and m = 1,..., k, compute di^ m = ifi) using 

Equation 4- 

(2) Assign each function ipi, i = 1,... ,p, to the cluster which minimizes di t . . 

(3) Update cluster means i/qj+i,... ,ipk,j +1 using the Karcher mean. 

(4) Setj=j + 1. 

(5) Repeat Steps 1~4 until cluster assignments remain unchanged. 

A major benefit of this algorithm is its flexibility. One can easily replace the 
fc-means formulation by, for example, fc-medians. This is especially useful when 
the mean may not be a good estimate of the posterior mode of interest. 

There are two main limitations of this algorithm: (1) the solution strongly de¬ 
pends on the initialization of the k cluster means, and ( 2 ) the number of clusters 
k must be specified a priori (usually the expected number of posterior modes 
is unknown). We address the first issue using hierarchical distance-based clus¬ 
tering as follows. To overcome limitation (1), we compute all pairwise distances 
between the given samples using Equation 4 and perform hierarchical clustering 
using the maximum linkage criterion. We then initialize the k- means clustering 
algorithm using the k clusters provided by hierarchical clustering. To address 
the second issue, we use the following procedure to determine the “correct” 
number of clusters or posterior modes k. First, we compute the pooled total 
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variance across all clusters for k = 1 and k = 2. To decide whether the posterior 
has multiple modes, we examine the percentage decrease in the pooled variance 
due to the additional second cluster. If the percentage decrease is greater than 
30%, we proceed to cluster the posterior samples. While this cutoff value seems 
ad-hoc, we have found through many simulations and real data examples that 
it works well in practice. Then, to decide on the final number of clusters, we use 
the silhouette measure of Rousseeuw [26]. To construct the silhouette for warp¬ 
ing function i, we require the following two values: (1) a(i), which is the average 
dissimilarity of warping function i to all other warping functions in the same 
cluster, and ( 2 ) b(i), which is the minimum average dissimilarity of warping 
function i to any of the other clusters; we use the Fisher-Rao distance (Equa¬ 
tion 4) as the dissimilarity measure. Then, the silhouette can be calculated as 
s(i) = ma x{l 7 q^b(i)} • The silhouette for a given warping function measures the 
appropriateness of its cluster assignment. The average of the silhouette measures 
over all posterior warping function samples can take values between -1 and 1 , 
which represent very poor and very good clusterings, respectively. The number 
of modes in the posterior is chosen as the number of clusters, which maximizes 
the average silhouette measure. 

Discretization: To define the Bayesian registration model, we first dis¬ 
cretize the observed functional data using a dense sampling of N points: [£] = 
{fi,..., tw} £ [0,1], where N depends on the application of interest. We study 
the effects of different values of N on the posterior inference in Section 5.1. This 
allows us to model differences between SRSFs using multivariate normal distri¬ 
butions. Note that the function / evaluated at the N discrete points is denoted 
by f{[t]) (similarly g([f]) for the SRSFs). As will be seen later, the warping 
functions do not require an explicit discretization in the given model. But, in 
order to compute the action of T on the observed functions (SRSFs), we also 
discretize them with the same N points in the implementation. Finally, we use 
discrete approximations to compute the quantities defined in this section. 

3. Bayesian Registration Model 

Given two functions /i, fi and their corresponding SRSFs gi, q 2 , we introduce 
a novel Bayesian model for function registration. Let denote (q 2 o (p~ 1 (tp)) , ip. 
At the first stage, we model the difference qi — gJIV' using a zero-mean Gaus¬ 
sian process. After discretization of the observed functions, we model the N 
differences gi[t] — g|[£]|^ using the multivariate normal distribution as follows: 

gi[t] — q^\t} | if), k ~ MVN(O n , ^-In) (likelihood denoted by L). (5) 

2 K 

This part of our model is exactly the same as that proposed in [2]. 

The second stage of our model places a truncated wrapped normal (TWN) 
prior distribution on the space of warping functions T by using their SRD rep¬ 
resentation: 

tp ~ TWN^,(fj,^,, £,/,) (denoted by 7 t^). ( 6 ) 
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We set the mean of the prior to be the identity mapping = 1, which provides 
natural regularization toward 7 id (he., no warping). We also truncate the sup¬ 
port of the prior to the valid space of warping functions given by ’k. Thus, the 
prior distribution 7 r^ is a truncated wrapped normal distribution defined and 
evaluated in Ti(T). This definition is similar to that presented in Kurtek et al. 
[12]; an alternative construction of Gaussian distributions on high-dimensional 
spheres is given in [4]. 

To define the covariance structure in the prior on warping functions, we re¬ 
quire an orthonormal basis in the tangent space T , i(’3/). We begin by defining a 
set of basis elements, which are orthogonal to the representation space and have 
unit L 2 norm: B\ = {\/3(l — 2 1), -\/2sin(27r It), \^2 cos(27 t It) | Z = 1,2,... n}. 
Then, to form an orthonormal basis for the tangent space Ti (T), we use the 
Gramm-Schmidt procedure under the L 2 metric. Notice that this orthonormal 
basis, denoted by B, is truncated by choosing a maximum number l = n, which 
yields 2n+l basis elements denoted by b. The truncation of the basis is important 
for additional regularization (smoothness of the warping functions) and compu¬ 
tational efficiency. Given an orthonormal basis in the tangent space T) ('I'), one 
can approximate any warping function using a set of basis coefficients given by 
tp k, c = {cj = (exp^ 1 ('i/>), bj), j = 1,..., 2n + 1}. Using this notation, we can 
write the truncated wrapped normal prior on warping functions as follows: 

7 r^(- 0 |l ,K) oc exp(-ic T KT _ 1 c)l^, (7) 

where 1 is the indicator function. We specify K as a diagonal covariance matrix 
with (j 2 /j 4 as the j th diagonal element with a large value for er 2 = 1000. Thus, 
we assume a weakly informative prior distribution on the directions given by the 
basis B. We choose quadratic decay of the standard deviation with respect to 
the degree of the basis functions based on simulations presented in Section 5.1. 
We require at least a linear decay for the eigenvalues of the covariance operator 
to be summable [3]. In practice, we want to favor smoother warping functions; 
thus, we weigh the low frequency basis elements (corresponding to low values of 
j ) higher than the high frequency basis elements; the variance of the additional 
linear basis element is not penalized. 

To model the concentration parameter in the likelihood, re, we use a vague 
gamma prior with parameters a = 1 and /3 = 0.01 (U(re) = 100, U(re) = 10000). 
This prior is denoted by 7 r K . We assume that the registration variable, ip, and 
the concentration in the likelihood, re, are independent. This is a reasonable 
assumption due to the fact that the alignment of two functions does not depend 
on their scale as shown in [32]. 

Under this specification of the model, the marginal posterior distribution of 
ip becomes: 

/»oo 

p(^\Qi,Q 2 ) cx / L{qx[t\ - q* 2 [t\\ip, re) 7 r^(^) 7 r K (re)dre 
Jo 

fw 2 + 1) _, n 

* ( 0 . 01 + |gi 
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where F denotes the gamma function. We will use importance sampling to sam¬ 
ple form the posterior distribution and perform Bayesian inference. 

3.1. Model Justification 

Here, we give a brief justification for each component of the proposed Bayesian 
registration model. In particular, we focus on the advantages of the given model 
over other possible choices. 

Likelihood: In the current work, we specify the likelihood as a multivariate 
normal distribution on the pointwise differences between two SRSFs represent¬ 
ing the observed functions. An alternative approach that is common in current 
literature is to model the pointwise differences between the observed functions 
themselves. Unfortunately, this suffers from the drawbacks discussed in detail 
in Secton 2. In particular, it is clear that, under that setup, the likelihood 
changes depending on whether one is aligning /2 to f\ or vice versa. This is 
a direct result of the lack of isometry of the L 2 metric under the action of T, 
i.e., ||/i o 7 — /2 ° b|| / ||/i ~ /all- See [2] for further justification of the given 
likelihood. 

Prior on T: We model the warping functions using a truncated wrapped 
normal distribution on the SRD space. This allows us to avoid discretizing the 
warping functions in the specification of the model (we only discretize at the 
final implementation stage), which is in contrast to the method presented by 
Cheng et al. [2]. In that work, the authors observe that warping functions are 
akin to cumulative distribution functions. Thus, they place a Diriclrlet prior on 
increments of the discretized warping functions. In contrast, we use a basis on 
the SRD tangent space, which allows us to model the full warping function up 
to the level of basis truncation (the warping function can be easily evaluated at 
any point on the domain [0,1] using the given basis). The proposed approach 
also permits one to easily incorporate prior knowledge into the model. First, 
the prior can be defined in a tangent space centered at any warping function. 
The given basis can be parallel translated using the simple expression in Sec¬ 
tion 2.1 to aid in this definition (see the next section for details). This can be 
especially useful if the observed functions are annotated with landmarks. Sec¬ 
ond, the prior knowledge about smoothness of the warping functions can be 
incorporated through the level of basis truncation. For smooth warping func¬ 
tions, the basis can be truncated at a relatively small number (and vice versa 
for “rougher” warpings). Finally, we are able to control the variance and decay 
in the diagonal covariance K , allowing further flexibility in the model. 

Prior on k: We choose the standard gamma prior on the concentration 
parameter k. The main advantage of this choice is that we are able to analytically 
marginalize the posterior over this parameter. This simplifies the importance 
sampling approach discussed in the next section. 
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4. Importance Sampling 

We begin by briefly introducing the concept of importance sampling and then 
provide some details of how this can be applied to our problem. Importance 
sampling is a variance reduction technique in Monte Carlo estimation where 
instead of directly sampling from a distribution of interest, which may be ineffi¬ 
cient, one first samples from an importance function and then re-samples based 
on appropriate weights. 

Suppose that we are interested in estimating the value of the following in¬ 
tegral: 9 = J x g(x)p(x)dx 1 where p is a probability density function. The clas- 
sical Monte Carlo estimate of this integral is given by 6 = d( x i)i where 
{aq,..., £s} are iid samples from p. If the variance of the classical Monte Carlo 
estimate is large it may be beneficial to introduce a new function h, termed the 
importance function , which can be used to generate the samples instead of p. 
One can then rewrite the integral as 9 = f x h{x)dx. The improved Monte 

Carlo estimate becomes 9 = JT =1 9( x i) w ( x i)i where {aq,..., are iid sam¬ 
ples from h and w(x) = We use this idea to generate samples from the pos¬ 
terior distribution represented by p as follows. Given a large sample {aq,..., xs} 
from h, we compute the associated weights as i = 1,..., S}. Then, to 

obtain s samples from p (where s <C S), we re-sample the set {aq,..., xg} 
with the corresponding (normalized) weights. This provides a flexible and effi¬ 
cient method for sampling from the posterior distribution. This process is also 
called sampling importance re-sampling (SIR). In the current work, we use an 
improved SIR method without replacement given in [28]. 

For our problem, we are faced with defining an importance function h that 
allows us to efficiently sample from the posterior p. The main requirement on 
h is that its support is the same as that of p. One option is to use the prior as 
the importance function directly, and generate the weights using the likelihood. 
But, in other cases, one may want to “upsample” a different part of the space, 
e.g., near the dynamic programming solution. Thus, we provide a general recipe 
for constructing wrapped normal importance functions similar to the definition 
of the prior on 4b 

In order to do this, we require a method for defining an orthonormal basis 
in the tangent space at any point on 4'. Given the truncated basis B in T 1 (T) 
defined in the previous section, we can define an orthonormal basis in the tangent 
space at an arbitrary point 7)^(4') using parallel transport, which was defined 
in Section 2.1. Parallel transport defines an isometric mapping between tangent 
spaces, and thus preserves the lengths of the basis vectors and the angles between 
them. We refer to the orthonormal basis in T ^. (4/) as B (with elements { b *., k = 
1,..., to}), and use it to define a coordinate system in that space. Thus, we can 
again approximate any warping function using a set of basis coefficients given 
by if « d = {dk = (exp~J(i p), bk), k = 1,..., to}. In this way, we can define a 
general version of the importance function as: 

h(%f\^,K h ) oc exp(--d T /v“ 1 d), (9) 
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Fig 2. We define wrapped normal importance functions in the tangent space at a pre-specified 
mean. One can generate random samples from these models on the tangent space and then 
use the exponential map to get a random warping function. 


where Kh is a diagonal matrix (Kh can be specified in the same was as in 
the prior). Note that there is no need to truncate the importance function. 
Figure 2 provides a pictorial explanation of our definition of the wrapped normal 
importance function in the tangent space. Under this setup, we can generate 
random samples from h in using the following procedure: 

1. For k = 1,..., m, sample Zk 1V(0,1); 

2. For k = 1,..., to, compute the random basis coefficients d T ^ ld = Zk\/Kh(k), 
where Kh(k) denotes the fcth diagonal element of the matrix I\h ; 

3. Form the random tangent vector using the basis B as v rnd = Y^k= i 

4. Map v rnd to T using tp rnd = exp fl ^(v rnd ); 

5. Compute the random warping function using r ) rnd = c/> _1 (i/j rnd ). 

Using the idea of importance sampling, we can re-write the posterior distri¬ 
bution in Equation 8 as follows: 


P(t%i,92) oc 


f(N/2 + 

(o.oi + \ qi [t]- q *m N/2+i Hipy 


( 10 ) 


It is obvious from the expression in Equation 10, that in the special case when 
the importance function is the same as the prior, one can simply sample from the 
prior distribution and weight each sample using the integrated likelihood. Thus, 
our approach is to generate a large sample {i/’ij..., 'i/’s} from h and evaluate a 
weight for each sampled warping function using: 


Vi = 


f{N/2 + 1) 


(0M + \ qi [t}- q * 2 [t]\ 2 ) N/2+1 


exp (—-(c 7 K 1 c—d 1 K h 1 d))l^, i = l,...,S, 


where c = {cj = (expj^ (ip),bj), bj £ B , j = 1,...,2 n+ 1} and d = {dj = 
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Fig 3 . (a) Simulated data with f in blue, f o 71 in red, f o 72 in green, and f o 73 in black. 
(b)-(d) True warpings 71, 72 and 73 in blue, respectively, with 100 posterior means in red. 


(exp bj £ B, j = 1,..., m} as before. Once all of the weights have 
been computed, we re-sample a small number s of ip s from the original set using 
the methods proposed in [28] . The re-sampled functions ipi,... ,ip s are samples 
from the posterior distribution p, and can be mapped to their corresponding 
warping functions using <p'. Posterior functionals can be mapped to T in the 
same way. 

5. Simulation Studies 

In this section, we present warping results using simulated scenarios. In all 
examples, we fix the original sample size to S = 500000, the posterior sample 
size to s = 200 , and the number of basis elements in the prior and importance 
function to N — 1, where N is the sampling density of the observed functions. 
The importance function used throughout the simulation studies and the real 
applications is a wrapped normal centered at the identity element with the same 
covariance structure as the prior. 

5.1. Simulation 1 

In the first simulation study, we consider the effects of function sampling density 
and the order of decay of the standard deviation in the prior distribution. For 
this purpose, we simulated three different warping functions, 71 = f+0.15t(l— t), 
72 = t + 0.70f(l — t), 73 = t + 0.1 sin(27r<), t = [0,1], and applied them to a 
function with two modes denoted by /. We display the original function / in 
Figure 3(a) in blue and the same function under the three warpings, / 07 ^ /072 
and / o 73 , in red, green and black, respectively. 

We apply the proposed model to perform pairwise Bayesian alignment for 
each example using 100 replicates, and report the detailed results in Table 1 
for quadratic decay of the prior standard deviations and sampling densities of 
50, 100 and 150 points. For each example, we report the average Fisher-Rao 
distance between the true warping function and the estimated posterior mean 7 
in panel (a), the Fisher-Rao distance between the true warping function and the 
dynamic programming solution 7 up in panel (b), and the average Fisher-Rao 
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Table 1 

Simulation results for correct warping recovery for three different warping functions and 
sampling densities (SD) under quadratic decay of the prior standard deviations, (a) Average 
dFR ( 7 t, 7 ) with the standard deviations in parentheses (T=true). (b) dpp^p ^dp) ■ ( c ) 
Average dpp("/x, ^dir) with the standard deviations in parentheses, (d) Average DPDp m 
(PM= proposed method), (e) DPDp>p. (f) Average DPDpjp. Best results are in bold. 


Ex. 

SD 

(a) 

(b) 

(c) 

(d) 

(e) 

(f) 


50 

0.0055 (0.0008) 

0.0359 

0.0145 (0.0024) 

95.4 

85.4 

93.6 

71 

100 

0.0018 (0.0004) 

0.0366 

0.0200 (0.0023) 

98.0 

88.3 

89.8 


150 

0.0018 (0.0007) 

0.0384 

0.0243 (0.0027) 

97.8 

88.8 

87.4 


50 

0.0124 (0.0029) 

0.0312 

0.0373 (0.0030) 

98.2 

96.5 

94.6 

72 

100 

0.0131 (0.0032) 

0.0211 

0.0436 (0.0037) 

98.3 

98.1 

93.4 


150 

0.0125 (0.0030) 

0.0182 

0.0480 (0.0032) 

98.4 

98.3 

92.2 


50 

0.0234 (0.0027) 

0.0284 

0.0707 (0.0033) 

96.0 

95.9 

88.1 

73 

100 

0.0258 (0.0029) 

0.0212 

0.0719 (0.0035) 

94.8 

97.0 

87.5 


150 

0.0259 (0.0028) 

0.0192 

0.0759 (0.0035) 

95.0 

97.1 

86.7 


Table 2 

Simulation results for correct warping recovery for three different warping functions and 
sampling densities (SD) under linear and no decay of the prior standard deviations, (a) 
Average c(f.r( 7 t, 7 ) with the standard deviations in parentheses (T=true). (b) Average 
DPDpm (PM=proposed method). 




Linear Decay 

No Decay 


Ex. 

SD 

(a) 

(b) 

(a) 

(b) 


50 

0.0060 (0.0007) 

94.9 

0.0084 (0.0008) 

92.9 

71 

100 

0.0028 (0.0006) 

97.3 

0.0090 (0.0008) 

89.3 


150 

0.0029 (0.0006) 

96.6 

0.0104 (0.0006) 

83.7 


50 

0.0176 (0.0033) 

97.4 

0.0389 (0.0035) 

94.1 

72 

100 

0.0200 (0.0035) 

97.4 

0.0559 (0.0029) 

91.4 


150 

0.0202 (0.0035) 

97.2 

0.0674 (0.0028) 

85.6 


50 

0.0320 (0.0033) 

94.6 

0.0685 (0.0039) 

88.4 

73 

100 

0.0347 (0.0031) 

93.1 

0.0875 (0.0032) 

81.7 


150 

0.0355 (0.0038) 

93.1 

0.1012 (0.0029) 

75.4 


distance between the true warping function and the estimated posterior mean 
when using a Dirichlet prior 7 dir in panel (c). In all of the presented results, 
we set the parameters of the Dirichlet distribution to ol \ = • • • = 070 = 1 (i.e., 
uniform prior on warping functions specified in the same was as in [ 2 ]), and 
use importance sampling to sample from the posterior. The standard devia¬ 
tions of the distances are also provided in parentheses. We highlight the best 
performance for each example and sampling density in bold. In all examples, 
the proposed geometric Bayesian model outperforms a model with a Dirichlet 
prior on the warping functions. Furthermore, the performance of the proposed 
method is comparable to, and often better than, the commonly used dynamic 
programming algorithm. 

In panels (d)-(f), we report the average percentage decrease in the distance 
between the two functions being registered, i.e., DPD = • 

Again, the proposed model performs very well according to this metric. It is 
important to note that the gains in performance are small when the sampling 
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Fig 4. (a) Simulated data with fa in blue, fa in red, fa o 71 in green, fa o 72 in black and 
fa o 7 _dp in magenta, (b) Estimated average warping functions in cluster 1 , ( 71 ) in green, in 
cluster 2 (fa ) in black, and using dynamic programming (7 bp ) in magenta. 


Table 3 

Clusterwise summaries of the posterior distribution, (a) Average cluster size, (b)-(d) 
Average L 2 distance between q\ and 52 after warping using the mean, median and MAP of 
each cluster, respectively. The standard deviations are given in parentheses. 



(a) 

(b) 

(c) 

(d) 

Cluster 1 
Cluster 2 

98.95 (6.48) 
101.05 (6.48) 

1.5016 (0.0143) 
1.5071 (0.0141) 

1.3769 (0.0150) 
1.3784 (0.0163) 

1.3020 (0.0474) 
1.3028 (0.0475) 


density is increased from 100 to 150 points. Thus, for fairly smooth functions, as 
is the case in this simulation and the applications presented in the subsequent 
section, we will sample the functions with 100 points for computational effi¬ 
ciency. The replicate posterior means for the proposed method are displayed in 
red in Figure 3(b)-(d) with the true warping in blue. It is clear from this figure 
that there is little variation across replicates and that we are able to recover the 
true warping very well. 

Table 2 reports the same set of results for linear and no decay in the prior 
standard deviations for the proposed method across the three sampling densities. 
Linear decay performs comparably to quadratic decay, while no decay does not 
perform well as expected. Throughout the rest of the paper we utilize quadratic 
decay as indicated by these simulation results. 

5.2. Simulation 2 

In the second simulation, we explore the performance of the proposed align¬ 
ment model when two modes are present in the posterior distribution. The 
two functions to be aligned, f\ and / 2 , are shown in Figure 4(a) in blue and 
red, respectively. In the same panel, we show the alignment results, across 100 
replicates, using the mean of each posterior cluster in green and black. For com¬ 
parison, we also display the dynamic programming result in magenta. Note that 
in this simulation we have treated the number of clusters as known (k = 2) and 
applied the fc-means clustering algorithm as described in Section 2.1. In panel 
(b), we display the two clusters of warping functions representing the two pos¬ 
terior modes (again in green and black) as well as the dynamic programming 
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result (in magenta). The clusterwise posterior mean warping functions are much 
smoother than the dynamic programming solution and achieve essentially the 
same level of alignment between the two functions. 

In Table 3, we provide a few summaries for each posterior cluster. In partic¬ 
ular, we report the average cluster size, and the average distance between the 
two functions based on clusterwise posterior mean, median and MAP alignment. 
We expect the clusters to be balanced as the peaks in the bimodal function are 
approximately equidistant from the peak in the unimodal function. This should 
also be reflected in the post-alignment, clusterwise distances between the two 
functions. The original distance between them is 2.6668, and the distance after 
dynamic programming alignment is 1.4221. The reported clusterwise distances 
are comparable to the dynamic programming solution when using mean warping, 
and better when using median and MAP warping. This shows that in addition 
to being able to discover multiple plausible alignments as modes of the posterior 
distribution, we are able to better explore the full space of warping functions 
than the deterministic dynamic programming algorithm. 

6. Applications 

Next, we consider pairwise alignment of functions using the proposed Bayesian 
model for various types of real data. We start with three types of biomedical 
signals: gait pressure functions, PQRST complexes extracted from an ECG and 
respiration functions. For a detailed description of these datasets please see [15]. 
We proceed to show examples on growth velocity functions for boys and girls 
obtained from the Berkeley Growth Dataset (BGD) [40]. Finally, we show two 
examples on signature (tangential) acceleration functions from a subset of the 
data described in [44]. In each example, we first determine whether multiple 
modes exist in the posterior distribution of warping functions. If this is the 
case, we cluster the posterior samples using fc-means clustering, where k is se¬ 
lected based on the average silhouette measure. Finally, we show the clusterwise 
alignment results and assess registration uncertainty in each cluster. 

6.1. Biomedical Signals 

We describe several alignment examples for biomedical signals. In all of the 
presented datasets, the functions must first be properly registered to align im¬ 
portant features across the functional observations. At times, due to signifi¬ 
cant structural differences, registration ambiguities result in multiple plausible 
alignments, which cannot be detected using optimization-based registration al¬ 
gorithms. This is especially seen in the gait pressure functional data, which we 
consider in the first set of examples. 

Gait pressure functions: We begin with three examples of pairwise align¬ 
ment of gait pressure functions. In the first example, shown in Figure 5, we 
discover three modes in the posterior distribution. Panel (a) displays the regis¬ 
tration results using the mean warping function in each cluster. The functions 
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Fig 5. Pairwise alignment of two gait pressure functions, (a) Original functions fi and f 2 
in blue and red, respectively; f 2 o 7 DP in magenta, /2 o 71 in green (cluster 1), /2 o 72 in 
black (cluster 2) and /2 073 in cyan (cluster 3). (b) Jdp in magenta, in red, and 71 , 
72 and 73 in green, black and cyan, respectively, (c) Pointwise average of f\ and /2 for each 
alignment result (colored in the same way as (a) and (b)). (d) Pointwise standard deviation 
(hot colors correspond to higher values) plotted on 71 , and the 95% credible interval in black. 

(e) Pointwise standard deviation (hot colors correspond to higher values) plotted on /2 ° 7 i- 

(f) -(i) Same as (d) and (e) but for clusters 2 and 3, respectively. 


to be registered are plotted in blue and red. The deterministic dynamic pro¬ 
gramming solution is given in magenta, and the three clusterwise alignments 
are shown in green (cluster 1), black (cluster 2) and cyan (cluster 3). The cor¬ 
responding avereage warping functions are shown in panel (b) with the identity 
element in red. Panel (c) displays the pointwise average of the two functions 
for each registration result. First, the cluster 3 alignment is nearly identical to 
the dynamic programming solution. The main benefit of the proposed Bayesian 
approach is the discovery of two other plausible alignments. The cluster 1 align¬ 
ment emphasizes the first mode in the gait functions (green average in panel 
(c)) while the cluster 2 registration weights both modes equally (black average 
in panel (c)). The cluster 3 alignment as well as the dynamic programming 


imsart-ejs ver. 2011/11/15 file: ps-sample_vl.tex date: February 7, 2017 





















































S. Kurtek/Geometric Bayesian Alignment of Functional Data 


18 



Fig 6 . Pairwise alignment of two gait pressure functions, (a) Original functions fa and fa in 
blue and red, respectively; fa°'YDP in magenta, fao^i in green (cluster 1) and faofa in black 
(cluster 2). (b) 7 up in magenta, in red, and 71 and 72 in green and black, respectively, 
(c) Pointwise average of fa and fa for each alignment result (colored in the same way as (a) 
and (b)). (d) Pointwise standard deviation (hot colors correspond to higher values) plotted 
on 71 , and the 95% credible interval in black, (e) Pointwise standard deviation (hot colors 
correspond to higher values) plotted on fa 071 . (f)-(g) Same as (d) and (e) but for cluster 2. 


solution emphasize both modes as well as the large dip toward the midpoint 
of the gait cycle. Panels (d)-(i) show the uncertainty in each cluster using two 
displays: (1) pointwise standard deviation as a color (blue to red=low to high) 
on the mean warping as well as the pointwise 95% credible interval in black, and 
(2) pointwise standard deviation as a color on the warped version of the second 
function. We usually observe lower standard deviation along the pronounced 
features such as the steep increase and decrease in pressure at the beginning 
and end of the gait cycle. On the other hand, the standard deviation is inflated 
in flat regions where many types of warping provide a satisfactory solution. 

The second example is displayed in Figure 6. In this case, we find two modes 
in the posterior distribution and display the same set of results as for the first 
example. The results are similar as in the previous case where different modes 
of the pressure functions are emphasized in each cluster. Again, the cluster 2 
mean is very similar to the dynamic programming solution. Importantly, the 
result based on the proposed Bayesian model is always much smoother while 
achieving very similar alignment. Finally, in Figure 7, we display an example 
where the posterior distribution of warping functions is unimodal. In this case, 
the two functions to be aligned have two very clear gait pressure modes, and 
thus, there is little uncertainty in the registration. 

PQRST complexes: The PQRST complex in ECG refers to the first peak 
(P wave), the sharp second peak (QRS complex), and the third peak (T wave). 
These functions have very pronounced features, and thus, most of the pairwise 


imsart-ejs ver. 2011/11/15 file: ps-sample_vl.tex date: February 7, 2017 






























S. Kurtek/Geometric Bayesian Alignment of Functional Data 


19 




Fig 7. Pairwise alignment of two gait pressure functions, (a) Original functions /i and f 2 in 
blue and red, respectively; f 2 07 DP in magenta and /207 in green, (b) 7 /j p in magenta, 7 ,,/ 
in red, and 7 in green, (c) Pointwise average of fi and /2 for each alignment result (colored 
in the same way as (a) and (b)). (d) Pointwise standard deviation (hot colors correspond to 
higher values) plotted on 7 , and the 95% credible interval in black, (e) Pointwise standard 
deviation (hot colors correspond to higher values) plotted on /2 o 7 . 


alignment results on this data yield a unimodal posterior distribution. We dis¬ 
play one example of such an alignment in Figure 8. The posterior mean warping 
is very similar to the dynamic programming solution, albeit smoother. Also, 
there is very little registration uncertainty around the QRS complex. Align¬ 
ment uncertainty is also low at the T wave, which is much more pronounced 
than the P wave in this example. The red (no warping) pointwise average of the 
two PQRST complexes displayed in panel (c) is clearly not a valid PQRST com¬ 
plex. As a result, warping in this application is necessary to obtain reasonable 
functional summaries. 

Respiration data: Each function in this dataset represents lung volume 
during a breathing cycle. Respiration cycle alignment is important for under¬ 
standing breathing variation as well as radiotherapy in lung cancer [15]. In this 
application, the posterior distribution of warping functions is also almost al¬ 
ways unimodal due to the very simple structure of each breathing cycle. Figure 
9 displays one example of pairwise alignment of two such respiration functions. 
Again, the produced posterior mean alignment is very good, with little uncer¬ 
tainty in the area of the peak of the breathing cycle. 
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Fig 8 . Pairwise alignment of two PQRST complexes, (a) Original functions /i and /2 in 
blue and red, respectively; f 2 07 DP in magenta and /207 in green, (b) 7 d p in magenta, 7 ,,/ 
in red, and 7 in green, (c) Pointwise average of fi and /2 for each alignment result (colored 
in the same way as (a) and (b)). (d) Pointwise standard deviation (hot colors correspond to 
higher values) plotted on 7 , and the 95% credible interval in black, (e) Pointwise standard 
deviation (hot colors correspond to higher values) plotted on /2 o 7 . 


6.2. Berkeley Growth Velocity Data 

A major goal in studying the growth velocity functions of children is to char¬ 
acterize the number and timing of growth spurts in boys and girls. The BGD 
has been studied for these purposes before [22]. In the current paper, we em¬ 
phasize that there may be multiple plausible time warpings that align growth 
spurts across children. In the first example, presented in the top part of Figure 
10, we examine two growth velocity functions for boys. The resulting posterior 
distribution on the space of warping functions is bimodal. The mean warping 
in both clusters nicely aligns the large growth spurt. But, the average growth 
velocity patterns, as seen in panel (c), are quite different depending on which 
alignment is used. The cluster 1 alignment (green) results in a long constant 
velocity growth period in the average, while cluster 2 (black) results in a de¬ 
creasing velocity (at an approximately constant rate) during the same period. 
This presents two very different growth mechanisms, which are useful for char¬ 
acterizing growth functions. The second example, shown in the bottom portion 
of Figure 10, considers alignment of two growth velocity curves for girls. Again, 
we discover two modes in the posterior distribution. As seen in panel (c), the 
mean warping in cluster 1 (green) emphasizes the first growth spurt and is fol¬ 
lowed by a smaller second spurt. On the other hand, the mean alignment in 
cluster 2 results in an average growth pattern where the two growth spurts are 
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Fig 9. Pairwise alignment of two respiration functions, (a) Original functions fi and f 2 in 
blue and red, respectively; f 2 07 DP in magenta and /207 in green, (b) 7 d p in magenta, 7 ,,/ 
in red, and 7 in green, (c) Pointwise average of fi and /2 for each alignment result (colored 
in the same way as (a) and (b)). (d) Pointwise standard deviation (hot colors correspond to 
higher values) plotted on 7 , and the 95% credible interval in black, (e) Pointwise standard 
deviation (hot colors correspond to higher values) plotted on /2 o 7 . 


approximately of the same size. 

6.3. Signature Acceleration Functions 

The final application considers alignment of signature acceleration functions. 
As described in [25, 38], each planar signature curve is first summarized us¬ 
ing its tangential acceleration. Comparison and modeling of such functions are 
important in understanding inter and intra-class signature variability, and for 
signature classification. A major difficulty that arises in the analysis pipeline 
is that the signature acceleration functions contain natural warping variability. 
Thus, in order to obtain satisfactory results, such variability must be accounted 
for. We present two different pairwise registration results in Figures 11 and 12. 
In the first example, the posterior distribution of warping functions contains 
two different modes. The posterior mean alignment agrees for close to half of 
the time interval at which point the mean warping in cluster 2 (black) diverges 
from the identity warping. This results in two drastically different alignments 
between the two signatures (and potentially different inferences depending on 
which alignment is used). Another interesting feature is that there is a large 
amount of uncertainty in the region where the mean warping in cluster 2 di¬ 
verges from the identity element; this indicates that the corresponding region 
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Fig 10. Top: Pairwise alignment of two growth velocity functions for boys. Bottom: Pairwise 
alignment of two growth velocity functions for girls, (a) Original functions fi and fa in blue 
and red, respectively; fa ° 7 dp in magenta, fa o 71 in green (cluster 1) and fa o 72 in black 
(cluster 2). (b) 7 _dp in magenta, 7 ^ in red, and 71 and 72 in green and black, respectively, 
(c) Pointwise average of fa and fa for each alignment result (colored in the same way as (a) 
and (b)). (d) Pointwise standard deviation (hot colors correspond to higher values) plotted 
on 71 , and the 95% credible interval in black, (e) Pointwise standard deviation (hot colors 
correspond to higher values) plotted on fa 071 . (f)-(g) Same as (d) and (e) but for cluster 2. 


of the two acceleration functions is difficult to align. The posterior distribution 
in the second example is unimodal, and the posterior mean is very close to the 
dynamic programming solution. Furthermore, perhaps surprisingly, there is very 
little uncertainty in the alignment. 
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Fig 11. Pairwise alignment of two signature acceleration functions, (a) Original functions 
fl and /2 in blue and red, respectively; f 2 o 7 jp in magenta, ft o 71 in green (cluster 1) 
and f 2 o 72 in black (cluster 2). (b) 7 op in magenta, 7 id in red, and 71 and 72 in green 
and black, respectively, (c) Pointwise average of /1 and f 2 for each alignment result (colored 
in the same way as (a) and (b)). (d) Pointwise standard deviation (hot colors correspond to 
higher values) plotted on 71 , and the 95% credible interval in black, (e) Pointwise standard 
deviation (hot colors correspond to higher values) plotted on j '2 071 . (f)-(g) Same as (d)-(e) 
but for cluster 2. 


6 . 4 ■ Groupwise MAP Alignment to a Known Template 

We close the applications section with several examples of groupwise function 
alignment to a known template. For each of the datasets described above (and 
an additional simulated dataset), we randomly select one of the functions in the 
data as a template and align all functions in a pairwise manner to this tem¬ 
plate. In these examples, we do not account for multimodality in the posterior 
distribution and use the MAP warping ( 7 map ) for alignment. The results are 
presented in Figures 13 and 14. For each example, we display the full origi¬ 
nal dataset with the template highlighted in black in panel (a). We show the 
aligned data in panel (b), the pointwise function averages before (red) and after 
(green) alignment in panel (c), and the estimated warping functions in panel 
(d). In all examples, we see a drastic improvement in function alignment using 
the proposed method, which directly translates to better data summaries such 
as the pointwise function averages. As a specific example, consider the PQRST 
complexes in row (3). The MAP alignment is able to correctly match the P 
waves, QRS complexes and T waves across all of the given data. This results 
in an accurate representation of the pointwise average, which shares all of the 
features present in the original data. On the other hand, the QRS complex is 
highly distorted in the unaligned pointwise average. Similar results are observed 
in the other examples with highest improvement for the gait pressure data and 
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Fig 12. Pairwise alignment of two signature acceleration functions, (a) Original functions 
/l and /2 in blue and red, respectively; fi o 733 P in magenta and / 207 m green, (b) 7/3 P in 
magenta, v lf [ in red, and 7 in green, (c) Pointwise average of fi and J '2 for each alignment 
result (colored in the same way as (a) and (b)). (d) Pointwise standard deviation (hot colors 
correspond to higher values) plotted on 7 , and the 95% credible interval in black, (e) Pointwise 
standard deviation (hot colors correspond to higher values) plotted on /2 07 . 


the signature acceleration data. 

7. Summary and Future Work 

We have presented a Bayesian model for pairwise registration of functional data. 
This model utilizes a convenient geometric representation of warping functions 
called the square-root density, which allows for efficient sampling from the pos¬ 
terior distribution via importance sampling. A main advantage of the proposed 
approach over previous optimization-based approaches is that it is possible to 
discover multiple plausible registrations, which are given by different modes in 
the posterior distribution. We present several simulated and real data examples 
that highlight these advantages. We use simulations to compare the results ob¬ 
tained using the proposed model to those obtained using a similar model with 
a Dirichlet process prior on the warping functions (which does not exploit the 
geometry of the space of warping functions). 

There are multiple directions for future work. First, we will extend these 
methods to a groupwise registration model where the template function and the 
warping functions are estimated jointly. Second, we will extend these methods 
to a setting where soft landmark information is provided on the functions of 
interest. In such a case, one can incorporate this information into the prior dis¬ 
tribution of the Bayesian model. Third, we will consider a more general problem 
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Fig 13. Groupwise MAP alignment to a known template, (a) Original functions with the 
template in black, (b) Aligned functions (f o'Imap) with the template in black, (c) Pointwise 
average before (red) and after alignment (green), (d) Warping functions 'jmap ■ (1) Simulated 
data. (2) Gait pressure functions. (3) PQRST complexes. (() Respiration functions. 


of curve alignment for shape analysis where the curves are functions from a unit 
interval (open curves) or unit circle (closed curves) to n > 1. Shapes of ob¬ 
jects are invariant to translation, scale, rotation and re-parameterization, and 
thus, the prior distributions in our Bayesian model will be defined on product 
spaces, whose geometric structure will play an important role. Finally, a major 
question relates to propagating the registration uncertainty to subsequent sta¬ 
tistical inference problems. One example is template estimation in the presence 
of multiple plausible warping solutions. 
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Fig 14. Groupwise MAP alignment to a known template, (a)-(d) Same as in Figure 13. (5) 
BGD for boys. (6) BGD for girls. (7) Signature acceleration data. 
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